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ABSTRACT 

Global run-on sequencing (GRO-seq) is a recent 
addition to the series of high-throughput 
sequencing methods that enables new insights 
into transcriptional dynamics within a cell. 
However, GRO-sequencing presents new algorith- 
mic challenges, as existing analysis platforms for 
ChlP-seq and RNA-seq do not address the unique 
problem of identifying transcriptional units c/e noi^o 
from short reads located all across the genome. 
Here, we present a novel algorithm for de novo tran- 
script identification from GRO-sequencing data, 
along with a system that determines transcript 
regions, stores them in a relational database and 
associates them with known reference annotations. 
We use this method to analyze GRO-sequencing 
data from primary mouse macrophages and derive 
novel quantitative insights into the extent and char- 
acteristics of non-coding transcription in mamma- 
lian cells. In doing so, we demonstrate that Vespucci 
expands existing annotations for mRNAs and 
lincRNAs by defining the primary transcript beyond 
the polyadenylation site. In addition, Vespucci gen- 
erates assemblies for un-annotated non-coding 
RNAs such as those transcribed from enhancer- 
like elements. Vespucci thereby provides a robust 
system for defining, storing and analyzing diverse 
classes of primary RNA transcripts that are of 
increasing biological interest. 



INTRODUCTION 

High-throughput sequencing has opened up a new 
window into transcriptional biology and the complex 
regulatory networks that define RNA and DNA inter- 
actions. Global run-on sequencing (GRO-seq) (1) is a 
recent addition to the series of sequencing-based 
methods that holds particular promise for understanding 
real-time transcriptional behavior. GRO-seq captures a 
point-in-time snapshot of active transcription genome- 
wide and returns data on the position, length and orien- 
tation of nascent transcripts. 

This sequencing technique is now being used to inspect 
the nature of transcriptional regulation in a number of 
experimental conditions (1^). The capture of nascent 
transcripts in each of these conditions reveals a variety 
of RNA species beyond the standard set derived from 
genes encoding proteins and microRNAs, including 
enhancer RNA (eRNA), long intergenic RNA 
(hncRNA) (2) and promoter-associated RNA (1,5). 
GRO-seq thus offers unprecedented insight into the gen- 
eration of a vast repertoire of non-coding transcripts that 
are of potential functional significance. 

The data collected, however, are both immense 
and unique; each experiment yields tens of miUions of 
strand-specific short RNA reads across the entire 
genome. This new sequencing method presents a new al- 
gorithmic challenge, as the peak-calling and exonic RNA 
identification techniques developed for other sequencing 
methods do not address the particular output of GRO- 
seq. Unlike ChlP-seq, peaks are not the primary unit of 
output, and, unhke RNA-seq, nascent transcripts can be 
anywhere, so relying on previously annotated regions such 



*To wliom correspondence should be addressed. Tel: +1 858 534 6011; Fax: +1 858 822 2127; Email: ckg(a; ucsd.edu 
© The Author(s) 2013. Pubhshed by Oxford University Press. 

This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.Org/Hcenses/by/3.0/), which 
permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited. 



2434 Nucleic Acids Research, 2014, Vol. 42, No. 4 



as NCBI Reference Sequence (RefSeq) (6) or microRNA 
genes is insufficient. 

To take full advantage of this novel data, regions 
beyond existing annotations must be considered. Units 
of transcription must be inferred de novo from the short 
read output of GRO-sequencing experiments. Existing 
analysis of GRO-seq data relies largely on adaptations 
of RNA-sequencing analysis techniques, with expression 
levels calculated from tag counts over gene bodies, pro- 
moters or other explicitly defined genomic regions (1^,7). 
New transcripts can be identified using software such as 
Cufflinks (8), but these rely on assumptions optimized for 
spliced RNA. For example. Cufflinks is optimized for 
paired-end reads, expects uniform density for a given tran- 
script (whereas GRO-seq can reveal pausing and other 
biologically relevant deviations from uniformity) and 
aims to accommodate large gaps (introns) in reads that 
result from splicing rather than from transcriptional 
breaks. In short. Cufflinks and similar exon-focused algo- 
rithms are not suited to distinguish between the sorts 
of small and closely spaced regulatory elements that 
GRO-sequencing reveals. 

Hah et al. have developed a Hidden Markov Model 
(HMM) for identification of regions of transcription spe- 
cifically within GRO-seq data (2). The software demar- 
cates transcripts using a two-state model, calling regions 
either 'transcribed' or 'un-transcribed', and thus is able to 
identify transcripts from GRO-sequencing short reads de 
novo. However, the HMM is optimized to accurately 
retrieve transcript boundaries as defined by RefSeq, result- 
ing in the loss or merging of many of the shorter, non- 
coding RNA transcripts that GRO-sequencing reveals. 
Further, because the software relies on flat files for pro- 
cessing and storage, it is difficult to integrate the called 
transcripts with other types of genomic data, including 
expression levels from each individual GRO-sequencing 
experiment and co-occurring peaks from ChlP-sequencing 
data. 

Here, we provide an algorithm for de novo identification 
of unified transcripts from GRO-seq data, along with an 
implementation that determines transcript regions, stores 
them in a relational database and associates them with 
known reference annotations according to 2D genomic 
overlap. Crucially, this method captures transcript 
boundaries as defined by RefSeq while maintaining the 
abiUty to identify non-coding RNAs at a high resolution 
and even retaining information about relative transcript 
abundance. Further, transcript identification feeds into a 
database that makes downstream integration of other 
datatypes feasible. 

Using this system, we were able to gain new insight into 
the types of nascent RNAs being generated inside primary 
murine macrophages. Although the ENCODE Project has 
begun the process of characterizing mature RNA species 
(9), there is little known about the extent and distribution 
of nascent RNAs, which, unlike mRNAs observed in trad- 
itional RNA-sequencing, include a number of transient 
RNA species that nonetheless play roles in the regulation 
of gene expression (1,10-13). Of particular interest are the 
vast numbers of non-coding RNAs recently found to be 
derived from transcription of active enhancers (14,15). 



The finding that at least some of these eRNAs contribute 
to enhancer function provides impetus for developing 
computational tools to define the sites of initiation of 
these species and their length. Importantly, while the 
start and termination sites of transcripts related to 
mRNA-encoding genes and lincRNAs have for the most 
part been estabUshed by conventional RNA sequencing 
studies, this information is virtually non-existent for 
eRNAs. Furthermore, the ENCODE consortium esti- 
mates that the human genome contains hundreds of thou- 
sands of enhancers (16), the majority of which are selected 
in a cell-specific manner. Therefore, each GRO-seq experi- 
ment in a new cell type results in the identification of tens 
of thousands of previously unannotated eRNAs that are 
derived from transcription of cell-specific enhancers. To 
address this challenge, we developed Vespucci as a com- 
putational method to systematically and quantitatively 
define discreet nascent transcripts from short sequencing 
reads obtained in GRO-seq experiments. By tuning 
parameters for specific types of transcripts, Vespucci 
returns accurate calls for primary mRNAs, while also 
deconvoluting complex patterns of transcription from 
enhancer-rich regions of the genome. Using Vespucci, we 
provide evidence that many nascent niRNA transcripts 
extend well beyond RefSeq annotated termination sites. 
In addition, Vespucci predicts approximately twice as 
many non-coding transcripts as were identified by other 
systems like the Hah et cd. HMM. These findings demon- 
strate the value of Vespucci in integrating disparate data 
types to characterize the variety of RNA species observed. 

The Python and PostgreSQL code, as well as a pre- 
loaded Amazon AMI, have been made available for 
implementation and expansion by interested researchers. 

MATERIALS AND METHODS 

Technical details 

The current implementation allows sample types and 
database schema to be spHt easily by cell type, such that 
the merging of transcripts is confined to a single cell type. 
Still, GRO-sequencing runs from multiple cell types can be 
easily merged together if desired. 

The current codebase assumes a PostgreSQL 9.2 
database installation; Python 2.7+; and Django 1.2+ 
with psycopg2 for database access. The codebase is 
hosted on Github at https://github.com/karmel/vespucci, 
and includes scripts to build both the transcript and the 
annotation databases. Instructions are included within the 
repository. In addition, a pre-loaded Amazon EC2 small 
instance image is available with instructions at https:// 
github.com/karmel/vespucci. 

Cell culture 

Primary cells were isolated from 6 to 8-week-old C57B1/6 
mice. All studies were conducted in accordance with the 
UCSD Institutional Animal Care and Use Committee. 
Thioglycollate-elicited macrophages were isolated by peri- 
toneal lavage 3^ days following peritoneal injection of 
2.5 ml thioglycollate. Cells were plated in RPMI medium 
1640 and 10% fetal bovine serum, washed after adherence 
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and again fed with fresh medium. The following day fresh 
medium containing 0.5% fetal bovine serum was added to 
the cells and serum starvation was carried overnight 

GRO-seq library preparation 

Briefly, GRO-sequencing takes advantage of a nuclear 
run-on reaction to incorporate tagged UTP into ongoing 
transcript synthesis by RNA polymerase. RNAs that 
incorporate the tagged nucleotides can subsequently be 
extracted and sequenced, producing a genome-wide 
library of nascent RNAs. Thus, in contrast to traditional 
RNA-sequencing in which mature stable RNAs are col- 
lected, GRO-sequencing returns short read data for RNAs 
in the act of being transcribed. 

Global run-on (1) and hbrary preparation for 
sequencing (17) were done as described. The protocol 
was performed as described in Wang et al. (3) 

Previously published GRO-seq data 

Four of the replicates were previously published 
under GSE48759 (4). The GEO Accession codes are 
GSM1183906-GSM 1183908 and GSMl 183914. The 
MCF-7 GRO-seq data are available under GSE27463 (2), 
Accession codes GSM678535-GSM678540; and GSE45822 
(12), Accession codes GSMl 115995-GSMl 115998. 

Read mapping and ChlP-Seq data analysis 

Reads were mapped to the mm9 genome using Bowtie2 

(18) with the default ahgnment options (specifically, the 
command bowtie2 -no-unal -x). 

H3K4mel and input data were taken from GSE21512 
(7), Accession codes GSM537986 and GSM537988. MCF- 
7 H3K4me2 data and input were taken from GSE24166 

(19) , Acession codes GSM594606 and GSM594608. Peaks 
were called using HOMER (7) using the command 
findPeaks and the options -nfr -style histone. 

Case study counts 

The SQL queries used to generate the counts for the 
analysis of transcription in macrophages are included as 
a Supplementary File. 

RESULTS 

Defining a transcript 
Principles 

As with most next-generation sequencing-based methods, 
GRO-seq relies on short (35-100 bp) reads. For the 
purposes of this study, we assume each individual read 
is mapped to the canonical genome of the organism in 
question with a standard aligner such as Bowtie (20). 
Any given uniquely mappable read, then, can be placed 
into ID space with a definitive coordinate consisting of 
chromosome, strand, start of read and end of read 
(Figure la). 

The location of each short read does not alone describe 
the relevant units of transcription in the genome; 
overlapping sets of short reads must be computationally 
merged so that they represent the extents of biologically 



relevant transcripts, which here we take to mean linear 
segments of DNA that are transcribed into continuous 
RNA sequences by RNA polymerase II (9). Once 
merged into continuous units, the count of short 
reads mapping to a given unit (transcript) can be used 
to approximate the relative expression level of the tran- 
script (21). 

A primary challenge in any short read RNA sequencing 
application is determining how to merge the fragments 
into unified transcripts. Each type of sequencing presents 
unique challenges in this regard; in the case of mRNA-seq, 
for example, methods have been developed that are 
designed to identify exon junctions (8,22). GRO-seq 
reads, in contrast, are expected to extend through 
intragenic regions, and further are expected to exist 
widely both in intergenic regions and in regions antisense 
to annotated transcripts (1). 

There are numerous patterns of GRO-seq data that are 
important to identify computationally. For example, 
promoter-associated RNA transcripts are generated at 
the promoters of genes, antisense to the genie transcript 
(1) (Figure lb). Any algorithm addressing GRO-seq needs 
to identify these RNAs as distinct units, overlapping with 
but not part of either genie transcripts or nearby eRNAs. 
Similarly, eRNAs are generated bi-directionally at enhan- 
cers (23), and any algorithm must identify each strand of 
eRNA as a separate but contiguous unit (Figure Ic). 

Notably, some transcripts appearing in GRO-seq data 
seem 'obvious' to separate when viewed in the UCSC 
browser, as with Figure lb and c. However, these cases 
are the minority, and, further, any such 'obvious' separ- 
ation is ad hoc and risks inconsistency when performed 
manually; in Figure Id, for example, most observers 
would not separate the transcript, but existing annotation 
data from RefSeq indicates there is an important 
boundary corresponding to a coding sequence. Thus, it 
is useful to have an algorithmic interpretation that 
provides a standardized analysis and additionally can 
appeal to existing annotation data if available (Figure Id). 

If there are no existing annotations from which to 
scaffold the current transcript identification, the algorithm 
must have a standard means of interpreting the short read 
data that is hkely to reflect the biological reality of the 
transcriptional data. To this end, the present implementa- 
tion makes several assumptions based on the expected 
behavior of RNA polymerase: first, that regions that are 
tiled without gaps by short reads from a single sequencing 
run are most likely continuously transcribed; and second, 
that gaps corresponding with a great disparity of read 
counts per basepair likely represent breaks in the path of 
RNA polymerase, with differently regulated transcripts on 
either side. (Supplementary Figure Sla shows a schematic 
of how differential density might yield separate tran- 
scripts, and Figure le shows an example of this in real 
GRO-sequencing data, where transcription along each 
strand in the displayed region is spht into two separate 
transcripts due to differential coverage.) 

These two principles — that overlapping reads should be 
merged and that disparity in the density of reads may war- 
rant separation of otherwise close transcripts — motivate 
the design of the algorithm described later in text. 
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Figure 1. GRO-sequencing reveals transcriptional dynamics in great detail, but can be difficult to interpret, (a) Short reads from GRO-sequencing 
experiments (red) can be mapped back to the reference genome (black) and assigned a genomic coordinate that includes chromosome, strand, starting 
basepair and ending basepair. (b) Promoter-associated RNA (paRNA) overlaps with the 5' end of the Tmbim6 gene, antisense to the gene itself The 
blue bar indicates the transcript that has been identified for Cstb itself, and the leftmost green bar shows the extent of the paRNA. (c) Enhancer 
RNA (eRNA) appears in GRO-sequencing samples (top track) as bi-directional transcripts centered on the binding sites of transcription factors 
(middle track) and marked by H3K4mel (bottom track), (d) Transcription can continue past the 3' ends of annotated RefSeq transcripts, making the 
exact boundary of relevant transcripts difficult to identify. At the MmpI2 locus, manual interpretation could lead to differing interpretations of 

(continued) 



Nucleic Acids Research, 2014, Vol. 42, No. 4 2437 



Implementation 

Given the size of GRO-seq data sets, with every sample 
yielding at least tens of milhons of reads, any algorithm 
must be implemented within a framework that is easily 
maintained and extended, and it must process data 
quickly enough to be useful in a laboratory setting. 
Further, the transcript identification system must be archi- 
tected such that new samples can be incrementally added 
to the full set of data without requiring re-processing of all 
data. To this end, we have developed a Python codebase 
that processes mapped short read files from GRO-seq 
experiments into continuous transcript units, determines 
relative expression levels on a sample-by-sample basis and 
stores the data in a PostgreSQL relational database that 
allows for complex coordinate-based queries over the 
transcriptional data. 

The procedure relies on two key parameters: 

• DENSITY MULTIPLIER: scahng factor to relate 
density to basepairs (see step 4 later in text; default: 
10000). Intuitively, this is the number of basepairs 
over which density is considered, so that a difference 
in one tag per DENSITY MULTIPLIER basepairs 
equates to a one basepair gap in genomic distance. 

• MAX_EDGE: maximum allowed distance in 2D space 
between proto-transcripts to be stitched together (see 
step 5 later in text; default: 500). 

The selection of values for these two parameters 
depends heavily on the desired use case. The larger the 
value of DENSITY_MULTIPLIER, the more density 
matters as compared with distance in basepairs, and the 
larger the value of MAX_EDGE, the more hkely distant 
transcripts are to be merged into single units. Thus, if the 
user desires to focus on large transcripts and genes, he/she 
might choose a low value for DENSITY MULTIPLIER 
and a large value for MAX EDGE. On the other hand, if 
the user desires to focus on small transcripts and ncRNA, 
he/she might choose a high value for DENSITY_ 
MULTIPLIER and a small value for MAX_EDGE. The 
selection of the default values of these parameters, and the 
values used for the data in this study, are discussed later in 
text under 'Parameter Selection.' 

Once these parameters are set, the processing of reads 
into transcripts proceeds as follows for each strand of each 
chromosome (Figure 2a): 

(1) Given a mapped tag file [in BAM or SAM (24) 
format], each tag is reduced to its genomic coordin- 
ates and loaded into a database table. The tables are 
designed such that the dataset for each sample is 
stored in a separate table. 

(2) Once loaded, the tags from a single sample are 
merged, but no analysis of density is attempted. 



Although individual tag boundaries are not main- 
tained in the merged format, the count of reads 
and number of gaps between the reads that are 
merged are tracked for expression level comparisons 
later. 

(3) The set of unified transcripts from a single sample is 
then merged with transcripts from all existing 
samples. 

(4) Using the stored tag counts and the genomic coord- 
inates of the merged read, each proto-transcript is 
mapped as a horizontal line segment in 2D space, 
with the start and end serving as the coordinates 
along the x-axis, and the density (tags in all runs 
per basepair) as the coordinate along the j'-axis. 
Density is scaled by a parameter (DENSITY_ 
MULTIPLIER); a higher multiplier increases the 
relative importance of density as compared with 
position. 

(5) A second stage of merging begins over the 2D space 
according to the algorithm described later in text and 
the MAX_EDGE parameter. At this stage, several 
optimization checks filter out proto-transcripts that 
are likely noise, such as those that have fewer than 
one tag per sample on average. 

(6) Transcripts are associated with annotation databases 
as described in part 2, later in text. 

(7) Transcripts are scored. Any scoring algorithm could 
be implemented here, but currently two are included: 

• A standard reads per kilobase per milhon tags 
(RPKM) score assignment 

• A custom, length-sensitive algorithm: 

Score = RPKM * logioo[max(l, length - 200)] 

This custom score has several modifications as compared 
with RPKM that make it more sensitive to certain kinds 
of transcripts: 

(1) Short transcripts (<200bp) are set to a score of 
0. This reduces noise from overlaps of several 
reads that get stitched together, and from tech- 
nical artifacts. 

(2) Long transcripts are handicapped. There are 
many long transcripts with low levels of tran- 
scription that are nonetheless interesting 
(Supplementary Figure Sib). RPKM alone has 
a tendency to decrease with transcript length 
(Supplementary Figure S2a), and thus it is 
difficult to filter out short noisy transcripts 
without losing long transcripts for which we 
accept lower levels of transcription. To address 
this problem, the custom score scales the RPKM 
by the log of the length of the transcript, thereby 
levehng out the scores of long transcripts 



Figure 1. Continued 

where to mark transcript boundaries, either including or excluding the run-off at the 3' end of the gene, and thus it is important to have a consistent 
algorithmically determined interpretation. Here, we show that Vespucci is able to either respect the RefSeq boundary (lower blue track) or to identify 
the entire nascent transcript (upper blue track), (e) Neighboring transcription regions can have different read densities. Two transcripts are identified 
along the sense strand, denoted by the blue bars at the top. These two transcripts are close in terms of basepair distance (363 bp apart), but they 
differ in terms of read-per-basepair densities, and therefore are kept as two separate transcripts. 
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Figure 2. Stepwise procedure for assembly of transcripts by Vespucci, (a) (1) Each sample, mapped to the reference genome, is reduced to its 
genomic coordinates and loaded into a separate database table. (2) Short reads from a single run are merged (separated by chromosome). (3) The 
merged proto-transcripts from each individual run are merged with proto-transcripts from other runs. The number of tags from each different run is 
stored. (4) The proto-transcripts from (3) are plotted in 2D space, with location in basepairs along the .v-axis and the density in tags per basepair 
along the r-axis. The density is scaled according to a paraineter, DENSITY_MULTIPLIER, that defines the relationship between the two units of 
measurement (basepairs and tags per basepair). (5) The proto-transcripts in 2D space are then merged according to a MAX_EDGE parameter that 
operates as the maximal allowed Euclidian distance from the rightmost edge of each transcript. The merged transcript here is considered a 

(continued) 
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(Supplementary Figure S2b). We use a logarithm 
with base 100 here to ensure that the score scales 
only minimally over the extremely wide range of 
transcript lengths. 

The choice of using the RPKM or the custom score 
depends largely on use case; if transcripts <200-bp long 
are of particular interest, as might be the case if one were 
studying pause-release mechanisms using GRO-seq (25), 
then it would be advisable to use unmodified RPKM 
instead of the custom score. 

When the processing is complete, the derived transcripts 
can be easily queried, annotated and associated using a 
relational database for storage (Figure 2b). 

Algorithm 

To stitch together continuous proto-transcripts in a 
density-aware manner, we first map the proto-transcripts 
in 2D space: along the x-axis is the start and stop, in zero- 
indexed basepairs of the read, and the j-axis represents 
the mean density of short reads over all samples 
[Figure 2a(4)]. To relate basepairs distance to read 
density, the density is scaled by the DENSITY_ 
MULTIPLIER. This graphical representation allows us 
to define the distance between any two transcripts as 
simply the Euclidian distance in this density-basepair 
plane [Figure 2a(5)]. 

Arguably, it is not necessary to relate density and 
basepair distance in this manner, and an alternate 
distance formula could consider separate thresholds for 
the differences in densities and positions of two proto- 
transcripts. However, such a distance function would 
not account for the biologist's intuition that the closer 
two transcripts are, the more hkely they are to be a 
single unit, even if there is a density difference that 
might at a greater positional difference warrant separation 
of two transcripts. In other words, the difference allowed 
between densities of two proto-transcripts when merging 
is dependent on the basepair distance between the two, 
and thus considering the Euclidian distance is preferable 
to a binary threshold that treats density and basepairs 
independently. 

With a distance function thus specified, we can define 
our algorithm for merging proto-transcripts into 
transcripts: 

(1) We define a graph in which every node is a proto- 
transcript, and a pair of nodes is connected by an 
edge if and only if the distance between the two 
proto-transcripts is less than or equal to the param- 
eter MAX_EDGE. 

(2) Connected components in this graph represent 
merged proto-transcripts. 

(3) Merged proto-transcripts can then be recast as inter- 
vals spanning the minimal basepair start and the 



maximal basepair end. Overlapping intervals are 
merged. 

At the end of this procedure, we have produced a set 
of continuous non-overlapping transcripts that can be 
stored, annotated and so on, as seen in Figure 2a(5)-(7). 

A naive algorithm would be quadratic, comparing every 
node with every other. However, in practice, nodes are 
ordered, and it is only necessary to consider nodes 
within a distance of MAX EDGE. Thus, the algorithm 
can be practically implemented in linear time with respect 
to the number of proto-transcripts. In the current imple- 
mentation, we take advantage of the geometric query 
space in PostgreSQL to limit the search for neighboring 
proto-transcripts to a distance of MAX_EDGE. 

Annotating a transcript 
Using known RefSeq 

The procedure described earlier in text can proceed 
naively — i.e. based entirely on 2D distance between tran- 
scripts and without awareness of existing annotations. In 
practice, it is useful for the implementation of the algo- 
rithm to respect the existing boundaries of genes as 
annotated by RefSeq, as this allows tag counts and 
computed expression values to be relevant in the context 
of the existing literature on gene body-based expression 
comparisons. 

Thus, the current implementation of the algorithm 
makes two important allowances for RefSeq genes. In 
the first, the allowed distance between proto-transcripts 
that is traversed during the 2D merging in step 5 can be 
increased within the boundaries of known RefSeq genes 
such that gaps are more likely to be covered within genes. 
This extra allowance increases the likehhood that long 
low-expression transcripts are recognized as single units 
rather than a series of small gapped transcripts. 

The second heuristic applied to the identification of pre- 
viously annotated transcripts addresses the continuation 
of transcription past the traditional transcription termin- 
ation site. In GRO-seq data, we see clearly that transcrip- 
tion does not always stop at the point corresponding to 
the annotated gene end, but rather continues on for some 
distance (Figure Id). In these cases, we may want to be 
able to compare GRO-seq expression counts in genes to 
the measurements made in previous RNA expression 
studies, and thus force a separation between tags falling 
within RefSeq boundaries and those that extend beyond 
the boundaries, even if the signal is continuous according 
to the general rules of merging outhned earlier in text. 
Vespucci can be configured either to force the transcript 
to be segmented according to RefSeq boundaries (lower 
blue track in Figure Id), so that comparisons can be made 
to more traditional expression data, or to assemble the 
transcript regardless of the annotated RefSeq boundaries 



Figure 2. Continued 

continuous unit of transcription by Vespucci. (6) These transcripts can then be associated with known RNA species from RefSeq and ncRNA 
databases based on genomic coordinates. (7) Transcripts are then scored according to a custoin algorithm or RPKM. (b) Database schema showing 
the Vespucci transcript table structure, major columns and related entities. An asterisk indicates a 'has many' relationship, and ID fields contain 
references to related tables. 
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(upper blue track in Figure Id), so that tiie full nascent 
transcript can be analyzed. In the case where segmentation 
along RefSeq boundaries is forced, the post-gene tran- 
script is linked via an index to its preceding gene tran- 
script, and we have used this option in Vespucci in the 
case study analyses later in text. 

Annotation from known databases 

In addition to segmentation according to known annota- 
tions like RefSeq, it is useful to be able to associate 
the known annotations with the transcripts that overlap 
in genomic space. Thus, the current implementation 
includes logic not only to define transcripts according to 
RefSeq boundaries but also to associate RefSeq identifiers 
with overlapping transcripts strand specifically. Similarly, 
we provide logic and data to make associations with 
non-coding RNA as identified by the Functional RNA 
Project (26). 

Arbitrary data types 

The representation of transcripts in terms of genomic 
coordinates gives power beyond associating with existing 
annotations. Arbitrary data types, such as peaks identified 
in individual ChlP-sequencing experiments, repeat 
regions, conservation scores and ESTs, could all be repre- 
sented in terms of genomic coordinates and used to 
annotate transcripts, either within the existing framework 
or ad hoc. There are many examples included in 
Supplemental SQL queries that collectively demonstrate 
that the power of the current system is its ability integrate 
expression data across many samples with multiple types 
of annotative or associative data based on genomic 
location and distance quickly and easily. 

A case study — transcription in macrophages 

Transcriptional profihng accomphshed by the ENCODE 
project has revealed that about three-quarters of the 
genome is transcribed across fifteen human cell lines (9). 
Using GRO-seq data from five biological replicates, we 
analyzed the characteristics of transcription in murine 
thioglycollate-elicited macrophages using Vespucci, with 
the intent of characterizing the extent of transcription in 
a particular primary cell type under unstimulated 
conditions. 

Parameter selection 

To optimize the selection of the MAX EDGE and 
DENS1TY_MULTIPLIER parameters, we took advan- 
tage of previously pubhshed 5'-GR0-seq data (10), 
which identify nascent RNA with a 5' 7-methylguanylated 
cap. The 5'-GRO-seq method thus produces peaks that 
identify transcription initiation sites of nascent RNAs 
genome-wide. The data available were in RAW 264.7 
cells, which are a macrophage cell line, and thus were 
expected to be compatible with the primary macrophage 
GRO-seq data. Because 5'-GRO-seq identifies transcript 
initiation sites, we would expect transcripts identified by 
Vespucci to have maximally one 5'-GRO-seq peak; having 
more than onewould be an indicator that the Vespucci 
transcript had merged together multiple separate units. 
Conversely, having zero 5'-GRO-seq peaks within a 



Vespucci transcript could indicate that noise was falsely 
assembled into a transcript, that a continuous transcript 
was divided into many transcripts, or that the two 
sequencing techniques differ in sensitivity. We desired, 
therefore, to select parameters that would maximize the 
rate at which identified transcripts corresponded with 
exactly one 5'-GR0-seq peak. Further, to avoid 
advantaging parameters that achieved this higher rate by 
greatly reducing the total number of transcripts, we added 
a penalty for the rate at which transcripts were identified 
with more than one 5'-GRO-seq peak. The resultant 
metric, which we labeled the Initiation Recapture Rate 
(IRR), is defined as: 

(transcripts with one 5'-GR0 peak 
= —transcripts with more than one 5'-GR0 peak) 
total transcripts 

We then tested values of MAX EDGE in the range of 
100-5000 and found that the maximum IRR was achieved 
at a MAX EDGE of 500 (Supplementary Figure Sic). 
Then, holding MAX_EDGE constant at 500, we tested 
values of DENSITY_MULTIPLIER in the range of 
1000-100000 and found that the maximum IRR 
was achieved at 10000 (Supplementary Figure Sid). 
Thus, we selected a MAX EDGE of 500 and a 
DENSITY_MULTIPLIER of 10000 for the current 
study and as the default parameters. Notably, as discussed 
later in text, these values perform well when used with 
human MCF-7 data as well, implying that the currently 
selected values are applicable to a variety of experimental 
data sets. 

Identification of RNA species 

We proceeded to analyze the murine macrophage 
GRO-seq data with a MAX_EDGE of 500 and a 
DENSITY_MULTIPLIER of 10000. Using these 
values, we see 11% of the sense strand (294 363 940/ 
2 620 345 972 bp) and 11% of the antisense strand 
(282 540 749/2 620 345 972 bp) being actively transcribed 
in basal conditions. 

These regions of transcription across the genome can 
then be inspected further. Using the unstimulated data, 
the total number of transcripts passing the minimal thresh- 
old to progress from proto-transcripts into the secondary 
transcript database is 84076; of these, 34743 (41%) had a 
score (as defined by the custom method described in step 7 
of the procedure earlier in text) of at least 1. The score 
threshold best suited for analysis depends heavily on 
intent; if the user is interested in transcripts that are tran- 
sient or have low expression, setting a lower threshold at 
the risk of introducing some noise may be advised. On the 
other hand, the HMM described by Hah et al. (2) resulted 
in only 22 893 transcripts in a human cell line; if the user 
desires a comparable high-threshold analysis with 
annotated regions making up ~50% of the transcripts 
identified, a higher score threshold can be used. 

Of these ~35 000 transcripts, only 8742 (25%) 
overlapped with RefSeq genes such that the gene was at 
least half transcribed. A further 1573 (5%) overlapped 
with RefSeq genes (same-strand) but covered less than 
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half of the gene. In all, 8079 (23%) transcripts overlapped 
with annotated ncRNA (Figure 3a). 

The remaining 21916 transcripts (63%) were not 
annotated by RefSeq or the ncRNA.org database. These 
unannotated transcripts comprised a large proportion of 
the total. Of the unannotated set of transcripts, 12 042 
(55%) were within 1 kb of a RefSeq transcript, of which 
5216 (43%) were specifically within 1 kb of an active 
RefSeq transcription start site (TSS), antisense the 
RefSeq transcript, and thus warranted labeling as 
promoter-associated RNA. The 2955 transcripts (25%) 
were antisense of transcribed RefSeq transcript bodies; 
these include intragenic enhancers and long ncRNA 
(Supplementary Figure S2c). 

Of the 21 916 unannotated transcripts, 9874 (45%) were 
> 1 kb away from any RefSeq transcript, and were thus 
labeled as distal transcripts. It has been estabhshed that 
enhancer elements in the genome are marked by unique 
histone methylation patterns (27,28) — namely, high levels 
of H3K4mel and H3K4nie2 but low levels of H3K4me3— 
and further are actively transcribed, generating transcripts 
(eRNAs) (23). To assign putative labels for distal tran- 
scripts, peaks called by HOMER (7) from H3K4mel 
ChlP-sequencing in unstimulated macrophages were 
loaded into the database and queried. In all, 6211 (63%) 
of the distal transcripts overlapped with H3K4mel peaks 
and were labeled eRNA. 

The remaining 3663 transcripts — 37% of distal tran- 
scripts and 13% of all transcripts — had no label. Closer 
inspection of this subset revealed that 867 of the 
unannotated distal transcripts were within 2kb of a 
H3K4mel peak. Interestingly, many of these appeared 
to be regions of transcription between clusters of enhan- 
cers (Figure 3b) or enhancer-associated RNA extending 
far past the range of the histone mark (Figure 3c). 
Taken together, these results imply that the amount of 
transcription attributable to enhancers is greater than cur- 
rently accounted for by analyses looking only at regions 
directly overlapping associated histone marks. 

In addition to these general categories of unannotated 
transcripts, there were some transcripts in this remainder 
set that were intriguing anomahes. For example, there was 
a 100 kb + region directly downstream of RefSeq gene 
Gm 14461 on chromosome 2 that exhibited active tran- 
scription, but was entirely unannotated by RefSeq, 
ncRNA.org or known mouse expressed sequence tags 
(ESTs) (Figure 3d). Transcription throughout this region 
was not continuous (Figure 3d, inset), and further there 
were several stretches of repeats that prohibited unique 
mapping of tags. Thus, the region was segmented into 
numerous blocks of transcription. Nonetheless, the iden- 
tification of such regions demonstrates the importance of 
closer inspection of GRO-sequencing data for the 
purposes of finding uncharacterized transcripts, as well 
as the value of the database described here in building 
these types of transcripts from merged units. 

Transcription does not stop at RefSeq termination sites 

Particularly interesting to us was the set of transcripts that 
continued past the annotated 3' ends of RefSeq transcripts. 



Closer inspection of this subset revealed that the vast 
majority (7346; 84%) of the -8700 expressed RefSeq tran- 
scripts did not terminate at the annotated TTS, but instead 
continued for some length afterward (Figure Id). The 
expression levels of these 'post-gene RNAs' were well 
correlated with the RefSeq transcripts they followed 
(Figure 4a), but the lengths of the post-gene RNAs were 
not determined by the lengths of the associated RefSeq 
transcripts (Supplementary Figure S3a) or the expression 
level of the associated RefSeq transcripts (Figure 4b, 
Supplementary Figure S3b). 

We next sought to determine why 16% of the expressed 
RefSeq transcripts did not continue past the annotated 
TTS. Of the 1396 RefSeq transcripts with no associated 
post-gene transcripts, 157 (11%) were labeled as rRNA 
rather than mRNA by RefSeq. The remaining mRNA 
had significantly lower expression levels than the set of 
RefSeq transcripts with post-gene RNA (Figure 4c, 
Supplementary Figure S3c), and transcription of these 
genes often did not reach the annotated TTS at all 
(Figure 4d). 

Given this difference in expression level, we next filtered 
the set of ~8700 expressed RefSeq transcripts down to the 
set of 6913 mRNA transcripts that do not stop before the 
annotated 3' end of the gene (84% of the 8231 expressed 
RefSeq mRNAs). Remarkably, 6305 (91%) of this set had 
associated post-gene RNA, indicating that the annotated 
TTSs of RefSeq genes greatly underestimate the extent of 
RNA transcription at these sites. 

Confirming results in human cells 

To confirm the extensibility of the results obtained in the 
macrophage data, we used Vespucci to analyze human 
GRO-sequencing data from MCF-7 cells from two 
separate studies (2,12). We used the same parameter 
values to ensure that the default values selected were not 
applicable only to murine data. In the human cell hne, a 
higher percentage of transcripts were unannotated than 
in the mouse ceUs (Supplementary Figure S2d). The dis- 
tribution of types of unannotated transcripts was surpris- 
ingly similar between the two MCF-7 ceU studies 
(Supplementary Figure S2e, left versus right panels). 
Fewer transcripts were called as eRNA as compared 
with the murine data; this is most hkely due to the fact 
that there is relatively little histone data available in MCF- 
7 cells, and the publicly available H3K4me2 data used 
here (19) was less deep than the mouse H3K4mel data 
used earlier in text. A larger fraction of the human 
unannotated transcripts remained unassigned to a 
known category of RNA. Manual inspection of these 
transcripts revealed that many overlapped with LINE, 
SINE and LTR elements identified by the RepeatMasker 
database (Repeat Library 20120124, accessed at http:// 
www.repeatmasker.org). We used Vespucci to annotate 
the remaining transcripts that overlapped with LINE 
elements and found that more than half of the remaining 
transcripts occurred at LINE elements. This corroborated 
recent reports of widespread transcription at retrotrans- 
posons being associated with oncogenesis (29-31). As a 
whole, these results indicated that both the default 
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Figure 3. Vespucci enables the identification and quantification of numerous RNA species in macropliages. (a) Using a score tlireshold of 1, tlie 
great majority (63%, left panel) of transcripts identified are not associated with known RefSeq genes or ncRNA. Of the unannotated set (right 
panel), more than one half are proximal to RefSeq genes, with the remainder being distal, (b) Transcripts are interspersed not only overlapping with 
the enhancer histone mark H3K4mel but also between enhancers, indicating that complex regulatory regions undergo a great deal of active 
transcription spread over many kilobases. (c) Similarly, transcripts can extend a long distance beyond identifying histone marks at enhancers, 
with this itergenic region showing low levels of H3K4mel and GRO-seq signal extending along a single strand for >5kb beyond an identified 
H3K4mel peak, (d) Vespucci identifies a long unannotated transcript downstream of Gml4461. Vespucci does not merge the entire region, but, with 
a gap parameter of 100 bp, separates it into several long regions with many shorter regions interspersed throughout. Closer inspection (inset) shows 
that the boundaries determined by Vespucci reflect real discontinuities in the GRO-seq signal that will require further study to interpret. H3K4mel is 
shown on the lower track to indicate that this transcript is methylated at the 5' end, much as a protein coding gene would be. 
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Figure 4. Transcription continues past the annotated 3' ends of most genes, (a) The expression levels of transcripts immediately following the 3' ends 
of RefSeq sequences are correlated with those of the preceding RefSeq transcripts as measured with Vespucci scores, (b) The length that transcription 
carries past the 3' end has a weak but positive correlation with the expression level of the preceding RefSeq transcript as measured with Vespucci 
scores, (c) The 16% of RefSeq transcripts are not found to have post-gene RNA according to Vespucci. These RefSeq transcripts tend to have much 
lower expression levels as measured with Vespucci scores than the 84% of transcripts that do continue past their annotated 3' ends, (d) In addition to 
having low expression levels, many of the RefSeq transcripts without post-gene RNA are notable in that the transcript called by Vespucci does not 
reach the annotated 3' end of the gene, as is the case with the Ube2w gene here. 



parameter values set in Vespucci and the analysis per- 
formed for mouse macrophages earlier in text could be 
repeated in data from multiple cell types, species and labs. 

Benchmarking Vespucci 
RefSeq benchmarking 

The two exceptions made for RefSeq annotations noted 
above allow for consistency with the widely maintained 
standard of counting tags over RefSeq regions. We 
compared the shared set of RefSeq transcripts identified 
by Vespucci with those identified by the 'analyzeRNA' 
method available in the HOMER (7) software package 



(Figure 5a) and found a high degree of correlation 
(r = 0.92). There are two systematic differences that 
account for the tag count discrepancies between the two 
data sets: HOMER sums tags for each RefSeq transcript 
separately, whereas Vespucci stitches over overlapping 
genes and isoforms and assigns the total tag count for 
the longest joined transcript to each associated RefSeq 
transcript (Figure 5b and c); and HOMER does not 
require continuity across long transcripts, and conse- 
quently counts tags that are missed by Vespucci when 
genes are too sparse to be adequately stitched together 
(Figure 5d). Notably, these discrepancies primarily affect 
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Figure 5. Vespucci retrieves RefSeq expression levels without losing non-coding RNAs. (a) RefSeq identifiers can be used to compare tfie tag counts 
determined by Vespucci at RefSeq genes with the tag counts determined by the HOMER software, which uses a gene-centric approach to sum GRO- 
seq tags over known genes. The correlation between tag counts is generally good, with deviations from the diagonal attributable to three primary 
categories of transcripts: (b) Vespucci does not segment transcripts at alternative isoforms, but returns the tags for the whole transcript for each 
contained isoform. In contrast, HOMER tallies tags within the precise boundaries of each isoform, resulting in discrepancies between the two 
methods at shorter isoforms, such as the Sppl gene seen here; (c) as with inultiple isoforms, overlapping genes are not segmented by Vespucci, and 
the tag count for the entire transcript covering Macfl is associated with the short gene that is overlapping, D83003IN03Rik; and (d) genes that have 

(continued) 



Nucleic Acids Research, 2014, Vol. 42, No. 4 2445 



transcripts that are difficult to interpret using GRO- 
sequencing data, as it is unclear how to divide up tag 
counts across overlapping transcripts (Figure 5b and c) 
or long, low-level transcripts (Figure 5d). Thus, with 
these two exceptions made for RefSeq transcripts, 
Vespucci produces transcripts comparable to existing 
annotations and methods of analysis. 

Benchmarking against a HMM 

Hah et al. (2) describe an HMM that determines regions 
of transcription from GRO-seq data using a two-state 
model. To assess the relevance of the transcripts found 
by Vespucci to those found by the Hah et al. HMM, we 
trained an HMM using the macrophage GRO-sequencing 
data described earlier in text. Parameters were optimized 
based on the prescribed procedure for the HMM, which 
relies on the sum of two errors: (i) the fraction of RefSeq 
transcripts that are broken apart by the called GRO-seq 
transcripts (Supplementary Figure S4a) and (ii) the 
fraction of GRO-seq transcripts that merge two or more 
RefSeq transcripts (Supplementary Figure S4b). Using 
these criteria, the minimum summed error (12.7%) was 
achieved with a negative log transition probability of 
100 and a shape parameter of 5. These parameters were 
used in the model compared directly with Vespucci, 
though similar results were achieved using the parameters 
selected by the optimization performed by Hah et al. (200 
and 5, respectively). 

We calculated the Vespucci's summed error according 
to the procedure used for the Hah et al. HMM and found 
the error to be 1.8%, or one-sixth of the comparable error 
for the HMM. This lower error serves to underHne the ad- 
vantage of using RefSeq boundaries to inform transcript 
identification, as it prevents Vespucci from breaking apart 
or merging together known regions of transcription. 
Notably, if Vespucci is used with the same parameters 
but without any prior knowledge of RefSeq regions, the 
summed error is about three times that of the optimized 
Hah et al. HMM (37.2%). This higMights both the advan- 
tage Vespucci gains by integrating with existing databases, 
and the fact that the default parameter set is designed to 
avoid overmerging unannotated regions of transcription. 
If a user desires to retrieve RefSeq transcripts without 
prior knowledge of RefSeq, a larger MAX EDGE param- 
eter may be used to achieve a lower error. With the macro- 
phage data and no assumption of RefSeq boundaries, 
Vespucci with a MAX_EDGE parameter of 5000 yielded 
a summed error rate of 12.2% (Supplementary 
Figure S4c), just below that achieved with the Hah et al. 
HMM. 



Given the designed use of Vespucci, the real question 
of performance comes with transcript calHng over 
unannotated regions of transcription. Whereas Vespucci 
identifies 24428 transcripts above a threshold score of 1 
that do not overlap same strand with a RefSeq transcript, 
the HMM identifies 9374. Closer inspection of this dis- 
crepancy reveals that the HMM is more likely to merge 
together transcripts Vespucci calls as separate (Figure 5e) 
and less Hkely to call transcripts when GRO-seq expres- 
sion levels are low (Figure 5f). Further, we compared the 
calls made by Vespucci and the HMM with the previously 
pubhshed 5'-GRO-seq data (10), and Vespucci more ac- 
curately captured the multiplicity of short transcripts 
associated with distinct transcripts than the HMM 
(Figure 5e and f). To quantify this merging or missing of 
transcripts by the HMM, we calculated the same two error 
rates described earlier in text for the HMM as compared 
with Vespucci, and found that 35.4% (7026) of the HMM 
transcripts are broken up by Vespucci transcripts, whereas 
1.3% (451) of Vespucci transcripts are broken up by 
HMM transcripts. Notably, changing the parameters of 
the HMM might result in higher sensitivity identification 
of transcripts, but only at the expense of reliable calhng of 
RefSeq genes. 

DISCUSSION 

GRO-sequencing reveals transcriptional dynamics at a 
genome-wide scale, and thus has the power to give 
unique and novel insight into the regulation of cellular 
processes. Taking full advantage of this new data source 
requires combining disparate data sets and identifying 
within them transcripts of interest. The system introduced 
here makes this possible by providing a framework for 
analyzing GRO-sequencing data at both a general level 
and in great detail. Further, Vespucci allows for easy in- 
tegration of many different types of sequencing data, 
which, when taken together, greatly increase the informa- 
tion gained from each single data type. 

In this study, we apply Vespucci to annotate nascent 
RNA transcripts defined by GRO-sequencing data 
obtained from primary mouse macrophages and a 
human breast cancer cell fine. This analysis yields a com- 
prehensive list of contiguous nascent transcription units 
derived from both promoters and enhancers throughout 
the genome. The Vespucci output provides genomic 
location, score, nearest gene and expression level in 
various sequencing runs of interest. By enabhng the quan- 
tification of GRO-sequencing data, we add it to the set of 
sequencing-based methods that can be rehably leveraged 



Figure 5. Continued 

few dispersed tags that cannot be adequately merged yield several smaller transcripts according to Vespucci, whereas HOMER implicitly joins them 
and counts all that fall along the body of the gene regardless of continuity of transcription, (e) The HMM described by Hah el al. identifies 
transcripts using a two-state model that calls regions transcribed (black bars) or untranscribed. The HMM identifies many fewer transcripts than 
Vespucci, in part, because it merges together transcripts called as distinct by Vespucci. Here, three pairs of bi-directional RNAs that are identified as 
two single units by the HMM. The bottom track shows data from previously published 5'-GRO-seq, a method that detects nascent RNA with a 5' 
7-methylguanylated cap. This method identifies start sites of nascent RNAs genome-wide. The data here, from RAW macrophages, show that 
Vespucci captures more accurately the separately initiated transcripts, (f) Similarly, some transcripts are called by Vespucci at expression levels too 
low for the HMM. Here, a paRNA is identified by Vespucci but not the HMM. The bottom track again shows 5'-GRO-seq from RAW macro- 
phages, where the paRNA start site can be clearly seen. 
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to investigate a wide array of biological questions. In the 
current study, we demonstrate the use of Vespucci to 
identify novel transcripts of interest, such as the long 
non-coding transcript near Gm 14461, and characterize 
the length and expression values of enhancer-associated 
RNAs. As each cell type contains a specific complement 
of enhancers that specify its identity and functional poten- 
tial, Vespucci will be a valuable tool for annotation of cell- 
specific eRNAs. In addition, Vespucci quantifies the 
extent of nascent transcription beyond the annotated 
3' ends of genes defined by the site of polyadenylation. 
This information may be useful in evaluating mechanisms 
and regulation of transcriptional termination. 

One shortfall of the current system is that the parameter 
defining acceptable gap distance between reads associated 
with the same transcript must be set heuristically, depend- 
ent on the needs of the user. Ideally, the parameters to 
identify transcriptional units from reads would be set to 
minimize errors against a gold standard of transcriptional 
units. At this time, no such standard for GRO-sequencing 
data exists. In the current study, we were able to approxi- 
mate a gold standard using 5'-GR0-sequencing data, and 
thus with Vespucci we hope to take the first step toward 
defining such a gold standard by providing a method and 
a framework for transcript identification. 

Vespucci extends beyond GRO-sequencing data, too; 
once the database is set up, it is straightforward to add 
data from ChlP-sequencing runs, external databases, 
known motifs, single nucleotide polymorphisms or any 
other data of interest that can be expressed within 
genomic coordinate space. In the current study, we 
demonstrated the integration of data on retrotransposons 
with the use of LINE data in analyzing MCF-7 cells. 
Similarly, one might integrate data on repeat regions 
and mappability; it is possible to load in genomic coord- 
inates of regions of the genome that preclude uniquely 
mapped reads and then allow the merging of transcripts 
to automatically ignore those regions. In the current im- 
plementation, we do not include this functionality, as it 
was found to yield too many spurious results. However, if 
a particular apphcation prefers inclusiveness in transcript 
merging, automatically covering across repeat regions can 
be incorporated into the system. This is just one example 
of the extensibihty of the system, demonstrating 
that Vespucci aUows for integration of many types of 
genomic data and sequencing samples, making more 
feasible analyses that cut across the whole breadth of 
samples and data sets available to a laboratory. 
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